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Abstract 


We introduce a novel approach for estimating Latent Dirichlet Allocation (LDA) parameters 
from collapsed Gibbs samples (CGS), by leveraging the full conditional distributions over 
the latent variable assignments to efficiently average over multiple samples, for little more 
computational cost than drawing a single additional collapsed Gibbs sample. Our approach 
can be understood as adapting the soft clustering methodology of Gollapsed Variational 
Bayes (CVBO) to CGS parameter estimation, in order to get the best of both techniques. Our 
estimators can straightforwardly be applied to the output of any existing implementation of 
CGS, including modern accelerated variants. We perform extensive empirical comparisons 
of our estimators with those of standard collapsed inference algorithms on real-world data 
for both unsupervised LDA and Prior-LDA, a supervised variant of LDA for multi-label 
classihcation. Our results show a consistent advantage of our approach over traditional CGS 
under all experimental conditions, and over CVBO inference in the majority of conditions. 
More broadly, our results highlight the importance of averaging over multiple samples in 
LDA parameter estimation, and the use of efficient computational techniques to do so. 


Keywords: Latent Dirichlet Allocation, topic models, unsupervised learning, multi-label 
classification, text mining, collapsed Gibbs sampling, CVBO, Bayesian inference 
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1. Introduction 


Latent Dirichlet Allocation (LDA) ( |Blei et ^ 2003) is a probabilistic model which organizes 
and summarizes large corpora of text documents by automatically discovering the semantic 
themes, or topics, hidden within the data. Since the model was introduced by|Blei et al. 


(2003), LDA and its extensions have been successfully applied to many other data types 
and application domains, including bioinformatics (Zheng et ah, 2006), computer vision 


(Cao and Fei-Fei, 2007), and social network analysis (Zhang et ah, 2007), in addition to 


text mining and analytics (AlSumait et ah, 2008). It has also been the subject of numerous 


adaptations and improvements, for example to deal with supervised learning tasks (Blei and 


McAuliffe, 2007 

Ramage et al., 2009 

Zhu et al., 2009) 

time and memory efficiency issues 

(Newman et al. 

2007 Porteous et al. 

2008 

Yao et al. 

2009 

) and large or streaming data 

settings ( 

Gohr et al. 

2009 

Hoffman et al. 

2010b; Rubin et ah, 2012 Foulds et al. 2013). 


When training an LDA model on a collection of documents, two sets of parameters are 
primarily estimated: the distributions over word types cf) for topics, and the distributions over 
topics 9 for documents. After more than a decade of research on LDA training algorithms, 
the Collapsed Gibbs Sampler (CGS) (Griffiths and Steyvers, 2004) remains a popular choice 
for topic model estimation, and is arguably the de facto industry standard technique for 
commercial and industrial applications. Its success is due in part to its simplicity, along with 


the availability of efficient implementations (McCallum 2002a Smola and Narayanamurthy 


of 


2010) leveraging sparsity (Yao et al. 2009 Li et al. 2014) and parallel architectures (Newman 


et ah, 2007). It exhibits much faster convergence than the naive uncollapsed Gibbs sampler 


Pritchard et al. (2000), as well as the partially collapsed Gibbs sampling scheme (Asuncion 


2011). The CGS algorithm marginalizes out parameters (p and 9, necessitating a procedure 


to recover them from samples of the model’s latent variables z. Griffiths and Steyvers (2004) 


proposed such a procedure, which, while very successful, has not been revisited in the last 
decade to our knowledge. Our goal is to improve upon this procedure. 


In this paper, we propose a new method for estimating topic model parameters p 
and 9 from collapsed Gibbs samples. Our approach approximates the posterior mean of 
the parameters, which leads to increased stability and statistical efficiency relative to the 
standard estimator. Grucially, our approach leverages the distributional information given 
by the form of the Gibbs update equation to implicitly average over multiple Markov chain 
Monte Carlo (MCMC) samples, with little more computational cost than that which is 
required to compute a single sample. As such, our work is focused on the realistic practical 
scenario where we can afford a moderate burn-in period, but cannot afford to compute and 
average over enough post burn-in MCMC samples to accurately estimate the parameters’ 
posterior mean. In our experiments, this requires several orders of magnitude further 
sampling iterations beyond burn-in. 


Our use of the full conditional distribution of each topic assignment is reminiscent of 
the CVBO algorithm for collapsed variational inference (Asuncion et al., 2009). The update 
equations of CVBO bear resemblance to those of CGS, except that they involve deterministic 
updates of dense probability distributions, while CGS draws sparse samples from similar 
distributions, with corresponding trade-offs in execution time and convergence behavior, 
as we study in Section 5.5 Our approach can be understood as adapting this dense soft 


clustering methodology of the CVBO algorithm to CGS parameter estimation, leveraging 
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training 

phase 

parameter 

recovery 

update 

rule 

inference 

algorithm 

memory 
per token 

GVBO 

dense 

dense 

deterministic 

variational Bayes 

0{K) 

GGS 

sparse 

sparse 

random 

MGMG 

0(1) 

GGSp 

sparse 

dense 

random 

MGMG 

0(1) 


Table 1: Properties of CVBO, CGS, and our proposed CGSp method. 


the corresponding uncertainty information, as in GVBO, but within a Markov chain Monte 
Garlo framework, in order to draw on the benefits of both techniques. Our approach does 
not incur the memory overhead of GVBO, relative to GGS. The properties of GVBO, GGS, 
and our proposed estimators are summarized in Table 

It is important to note that we do not modify the GGS algorithm, and thus do not affect its 
runtime or memory requirements. Rather, we modify the procedure for recovering parameter 
estimates from collapsed Gibbs samples. After running the standard GGS algorithm (or a 
modern accelerated approximate implementation) for a sufficient number of iterations, we 
obtain the standard document-topic and topic-word count matrices. At this point, instead 
of using the standard GGS parameter estimation equations to calculate 9 and cf, we employ 
our proposed estimators. As a result, our estimators can be plugged in to the output of any 
GGS-based inference method, including modern variants of the algorithm such as Sparse 


LDA 

Yao et al. 

et al. 

2015 

), or \ 


2009), Metropolis-Bastings-Walker (Li et ah, 2014), LightLDA (Yuan 


2016), allowing for easy and wide adoption of our 


approach by the research community and in industry applications. Our algorithm is very 
simple to implement, requiring only slight modifications to existing GGS code. Popular 


LDA software packages such as MALLET (McGallum 2002a) and Yahoo_LDA (Smola and 


Narayanamurthy, 2010) could straightforwardly be extended to incorporate our methods. 


leading to improved parameter estimation in numerous practical topic modeling applications. 
In Section 5.7, we provide an application of our estimators to both MALLET’s Sparse 


LDA implementation, and to WarpLDA. When paired with a sparse GGS implementation, 
our approach gives the best of two worlds: making use of the full uncertainty information 
encoded in the dense Gibbs sampling transition probabilities in the final parameter recovery 
step, while still leveraging sparsity to accelerate the CGS training process. 

In extensive experiments, we show that our approach leads to improved performance 
over standard GGS parameter estimators in both unsupervised and supervised LDA models. 
Our approach also outperforms the GVBO algorithm under the majority of experimental 
conditions. The computational cost of our method is comparable that of a single (dense) 
GGS iteration, and we show that even when a state-of-the-art sparse implementation of 
the GGS training algorithm is used, this is a price well worth paying. Our results further 
illustrate the benefits of averaging over multiple samples for LDA parameter estimation. 
The contributions of our work can be summarized as follows: 


• We propose a novel, theoretically motivated method for improved estimation of the 
topic-level and document-level parameters of an LDA model based on collapsed Gibbs 
samples. 
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We present an extensive empirical comparison of our proposed estimators against those 


of standard CGS as well as the CVBO algorithm (Asuncion et al., 2009) —a variant of 


the Collapsed Variational Bayesian inference (CVB) method—in both unsupervised 
and supervised learning settings, demonstrating the benefits of our approach. 


We provide an additional experimental comparison of the CGS and CVBO algorithms 
regarding their convergence behavior, to further contextualize our empirical results. 


Our theoretical and experimental results lead to three primary conclusions of interest to 
topic modeling practitioners: 

• Averaging over multiple CGS samples to construct a point estimate for LDA parameters 
is beneficial to performance, in both the unsupervised and supervised settings, in cases 
where identifiability issues can be safely resolved. 

• Using GVBO-style soft clustering to construct these point estimates is both valid and 
useful in an MCMC setting, and corresponds to implicitly averaging over many samples, 
thereby improving performance with little extra computational cost. This ports some 
of the benefits of CVBO’s use of dense uncertainty information to CGS, while retaining 
the sparsity advantages of that algorithm during training. 

• Increasing the number of topics increases the benefit of our soft clustering/averaging 
approach over traditional CGS. While CVBO outperforms CGS with few topics and 
a moderate number of iterations, CGS-based methods otherwise outperform CVBO, 
especially when using our improved estimators. 


As this paper deals with LDA in both the unsupervised and the supervised setting, we 
will treat the following terms as equivalent when describing the procedure of learning an 
LDA model from training data: ‘estimation’, ‘fitting’, and ‘training’. Since we are in a 
Bayesian setting the word ‘inference’ will refer to the recovery of parameters, which are 
understood to be random variables, and/or the latent variables, and to the computation of 
the posterior distributions of parameters and latent variables. The term ‘prediction’ will be 
applied to the case of predicting on test data (i.e. data that was unobserved during training). 
Additionally, since Prior-LDA is essentially a special case of unsupervised LDA, we will 
focus our algorithm descriptions on the case of standard unsupervised LDA and specify the 
exceptions as they apply to Prior-LDA. Furthermore, we note here that we will be using the 
CGS algorithm described by Griffiths and Steyvers (2004), which approximates the LDA 
parameters of interest through iterative sampling in a Markov chain Monte Carlo (MCMC) 
procedure, unless otherwise specified. 

The rest of this paper is organized as follows: Section describes backround information 
on the LDA model, Bayesian inference, and the CGS and the GVBO inference methods, as 
well as the Prior-LDA model, providing an overview of the related literature. Sections 
and 1^ introduce our estimators for the relevant LDA parameters. Sections and describe 
our experiments comparing the estimators of our method to those of CGS and CVBO in 
unsupervised and supervised learning settings, respectively. Finally, Section summarizes 
the main contributions and conclusions of our work. 
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2. Background and Related Work 

In this section we provide the background necessary to describe our methods, and also 
establish the relevant notation. 

2.1 Latent Dirichlet Allocation 

LDA is a hierarchical Bayesian model that describes how a corpus of documents is generated 
via a set of unobserved topics. Each topic is parameterized by a discrete probability 
distribution over word types, which captures some coherent semantic theme within the 
corpu^ Here, we assume that this set of word types is the same for all topics (i.e. we 
have a single, shared vocabulary). Furthermore, LDA assumes that each document can 
be described as a discrete distribution over these topics. The process for generating each 
document then involves first sampling a topic from the document’s distribution over topics, 
and then sampling a word token from the corresponding topic’s distribution over word types. 

Formally, let us denote as V the size of the vocabulary (the set of unique word types) 
and as DTrain = \T^Train\ and Dxest = \T^Test\ the number of training and testing documents 
respectively. We will use K to denote the number of topics. Similarly, v denotes a word type, 
Wi a word token, k a topic (or in the case of Prior-LDA, a label), and zt a topic assignment. 
The parameter (f)kv, k £ {1...K}, v £ {1...V} represents the probability of word type v under 
topic (or label) k, and 6dk represents the probability of topic (or label) k for document d. 
Additionally will denote the parameter of the Dirichlet prior on 9 for topic k and the 
parameter of the Dirichlet prior on cj) for word type v. Unless otherwise noted we assume for 
simplicity that the two priors are symmetric, so all the elements of the parameter vector a 
will have the same value (the same holds for the parameter vector /3). The term Nd denotes 
the number of word tokens in a given document d. Lastly, in the multi-label setting, JCd 
stands for the set of labels that are associated with d. This notation is summarized in Table 

El 

LDA assumes that a given corpus of documents has been generated as follows: 

• For each topic k £ {1.. .K}, sample a discrete distribution (pk over the word types of 
the corpus from a Dirichlet(/3). 

• For each document d, 

— Sample a discrete distribution 9d over topics from a Dirichlet(Q:). 

— Sample a document length Nd from a Poisson(^), with ^ £ ]R>o 
— For each word token Wi in document d, i £ {l...Ad}: 

* Sample a topic Zi = k from discrete(0d)- 

* Sample a word type Wi = v from discrete((/)^J. 

The goal of inference is to estimate the aforementioned 6 and cj) parameters—the discrete 
distributions for documents over topics, and topics over word types, respectively. In doing 
so, we learn a lower-dimensional representation of the structure of all documents in the 
corpus in terms of the topics. In the unsupervised learning context, this lower-dimensional 

1. A discrete distribution a multinomial distribution for which a single value is drawn, i.e. where N — 1. 
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V the size of the vocabulary 

K the number of topics (or labels) 

DTrain the number of training documents (similarly DTest) 

Terrain the Set of training documents (similarly VTest) 
d a document 

V a word type 

Wi a single word token, i.e. an instance of a word type 
Wrf the vector of word assignments in document d 
k a topic (label) 

Zi the topic assignment to a word token Wi 

Zd the vector of topic assignments to all word tokens of a document 

Zdjk binary indicator variable, equals 1 IFF the jth word in document d 

is assigned to topic k 
Nd number of word tokens in d 
JCd set of topics (or labels) in d 

Ukv number of times that word type v has been assigned to topic k across 
the corpus 

ndk number of word tokens in d that have been assigned to topic k 
(j)k word type distribution for topic k 

6d topic distribution for document d 

4>kv topic-word type probability 

6dk document-topic probability 

ak Dirichlet prior hyperparameter on 9 for topic k 

Py Dirichlet prior hyperparameter on (j) for word type v 

Idik CVBO variational probability of word token Wi in document d being 
assigned to topic k 


Table 2: Notation used throughout the article. 


representation of documents is useful for both summarizing documents and for generating 
predictions about future, unobserved documents. In the supervised, multi-label learning 
context, extensions of the basic LDA model—such as Prior-LDA —put topics into one-to-one 
correspondence with labels, and the model is used for assigning labels to test documents. 


2.2 Bayesian Inference, Prediction, and Parameter Estimation 

LDA models are typically trained using Bayesian inference techniques. Given the observed 
training documents T>Train, the goal of Bayesian inference is to “invert” the assumed genera¬ 
tive process of the model, going “backwards” from the data to recover the parameters, by 
inferring the posterior distribution over them. In the case of LDA, the posterior distribution 
over parameters and latent variables is given by Bayes’ rule as 


p{(j),e,z\T)Train,OL,l3) 


PiT^TrainH, z)p((/), 6*, z|q:, /3) 
piT^Train\(^i / 3 ) 


6 



Improved Gibbs Sampling Parameter Estimators for LDA 


Being able to make use of the uncertainty encoded in the posterior distribution is a key 
benefit of the Bayesian approach. In the ideal procedure, having computed the posterior it 
can be used to make predictions on held-out data via the posterior predictive distribution, 
which computes the probability of new data by averaging over the parameters, weighted by 
their posterior probabilities. For a test document under the LDA model we have 

p(w(-^-)|DTr-a*n,a,/3) = J j 


Similarly, for supervised variants of LDA (discussed below), the ideal procedure is to 
average over all possible parameter values when predicting labels of new documents. In 
practice, however, it is intractable to compute or marginalize over the posterior, necessitating 
approximate inference techniques such as MCMC methods, which sample from the posterior 
using an appropriate Markov chain, and hence approximate it with the resulting set of 
samples. In particular, Nguyen et al. (2014) advocate using multiple samples to approximate 
the posterior, and averaging over the corresponding approximation to the posterior predictive 
distribution in order to make predictions for new data. 

However, in the LDA literature it is much more common practice to simply approximate 
the posterior (and hence posterior predictive) distribution based on a single estimated value 
of the parameters (a point estimate). Although not ideal from a Bayesian perspective, a 
point estimate (p, 9 of the parameters is convenient to work with as it is much easier for a 
human analyst to interpret, and is computationally much cheaper to use at test time when 
making predictions than using a collection of samples. In this paper, while strongly agreeing 
with the posterior predictive-based approach of Nguyen et al. (2014) in principle, we take 
the perspective that a point estimate is in many cases useful and desirable to have. 

Following Nguyen et al. (2014), we do, however, have substantial reservations regarding 
the ubiquitous standard practice of using a single MCMC sample to obtain a point estimate. 
By interpreting MCMC as a “stochastic mode-finding algorithm” we can view this procedure 
as a poor-man’s approximation to the mode of the posterior. For many models, with a 
sufficiently large amount of data the posterior will approach a multivariate Gaussian under 
the Bernstein-von Mises theorem, and eventually become concentrated at a single point 
under certain general regularity conditions, cf. (]Kass et al. 1990, Kleijn and van der Vaart 


2012). In this regime, a point estimate based on a single sample will in fact suffice. More 


generally, however, due to posterior uncertainty this procedure is unstable and consequently 
statistically inefficient: a parameter estimate from a posterior sample has an asymptotic 
relative efficiency (ARE) of 2, meaning, roughly speaking, that the variance of the estimator 
requires twice as many observations to match the variance of the ideal estimator, in the 
asymptotic Gaussian regime, under general regularity conditions, cf. (Wang et ah, 2015[). 


As a compromise between the expensive full Bayesian posterior estimation procedure 
and noisy “stochastic mode-finding” with one sample, in this paper we propose to use the 
posterior mean, approximated via multiple MCMC samples, as an estimator of choice in 
many cases. The posterior mean, if it can be precisely computed, is asymptotically efficient 
(i.e. it has an ARE of 1), under the conditions necessary for the Bernstein-von Mises theorem 
to hold, cf. (Kleijn and van der Vaart, 2012). Although we cannot compute the posterior 
mean exactly, and the conditions necessary for the Bernstein-von Mises theorem have not 
been rigorously verified for LDA, these results provide a useful intuition for why we should 
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Algorithm 1 Pseudocode for our proposed estimators. For simplicity, we consider only the 
case where a single CGS sample z is used. 


function 9^ 

Inpnt: Burned-in CGS sample z with associated count matrices, documents T> 
for d = 1 to D do 

for j = 1 to Nd do 

V := Wdj 

for /c = 1 to A do 

Pdjk ■— ~ 

«' = 1 

end for 

Pdj,: ■= Pdj,: •/ SUm(pdi,:) 

end for 

■= ^d,: •/ SUm(0P^) 

end for 
return 9^ 
end fnnction 


Retrieve the word token from the word position j in d. 
{ndk^dj + a k) > During testing, the first term is replaced by 

Normalize such that Pdjk — •/ denotes elementwise division. 

O Normalize 9 ^ . such that 0 ^^ = 1 


function (ji^ 

Inpnt: Burned-in CGS sample z with associated count matrices, documents VTrain 
for /c = 1 to A do 



end for 

for d = 1 to A do 
for j = 1 to Nd do 

V := Wdj 

for A: = 1 to A do 

Pdjk = V • {udk^dj + «fc) 

E («fc„'^dj+du) 

v' = l 

end for 

Pdj,\ Pdj,: •/ SUm(P(^j ,) [> Normalize such that Pjjjfc = 1. 

$f,v ■— ^,v ~\~ Pdj,: 

end for 
end for 

for A = 1 to A do 

:= : ./sum(0^ .) > Normalize , such that 1 

end for 
return (pP 
end fnnction 
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prefer this approach over an estimator based on a single sample. We expect a sample-based 
approximation to the posterior mean to be much more stable than an estimator based on 
a single sample, and our empirical results below will show that this is indeed the case in 
practice. In our proposed algorithm, we will show how to effectively average over multiple 
samples almost “for free”, based on samples and their Gibbs sampling transition probabilities. 
This is valuable in the very common setting where we are able to afford a moderate number 
of burn-in MCMC iterations, but we are not able to afford the much larger number of 
iterations to compute enough samples required to accurately estimate the posterior mean of 
the parameters. 


It should be noted that averaging over MCMC samples for LDA, without care, can be 
invalidated due to a lack of identifiability. Multiple samples can each potentially encode 
different permutations of the topics, thereby destroying any correspondences between them, 
and Griffiths and Steyvers (2004) strongly warn against this. However, in this work, we 
will identify certain cases for which it is safe, and indeed highly beneficial, to average over 
samples in order to construct a point estimate, such as when estimating 9 at test time with 
the topics held fixed. 


2.3 Collapsed Gibbs Sampling for LDA 

Gibbs Sampling (CGS) algorithm for LDA, introduced by 
, marginalizes out the parameters (p and 6, and operates only 
variable assignments z. This leads to a simple but effective MCMC algorithm which mixes 
much more quickly than the naive Gibbs sampling algorithm. To compute the CGS updates 
efficiently, the algorithm maintains and makes use of several count matrices during sampling. 
We will employ the following notation for these count matrices; n^v represents the number 
of times that word type v is assigned to topic k across the corpus, and ridk represents the 
number of word tokens in document d that have been assigned to topic k. The notation for 
these count matrices is included in Table [2j 

During sampling, CGS updates the hard assignment Zi of a word token Wi to one of 
the topics k G {1...A}. This update is performed sequentially for all word tokens in the 
corpus, for a fixed number of iterations. The update equation giving the probability of 
setting Zi to topic k, conditional on Wi, d, the hyperparameters o: and (3, and the current 
topic assignments of all other word tokens (represented by •) is: 


The Collapsed 


Steyvers (2004 


Griffiths and 
on the latent 


p{zi = k\wi 


V, d, a, f3, ■) oc 




V 

ij^kv'^i T 

v'=l 


K 

Nd+ ^ Oik' 
k'=l 


( 1 ) 


In the above equation, one excludes from all count matrices the current topic assignment of 
rcj, as indicated by the notation. A common practice is to run the Gibbs sampler for a 
number of iterations before retrieving estimates for the parameters, as the Markov chain is 
typically in a low probability state initially and so early estimates are expected to be of poor 
quality. After this burn-in period, we can retrieve estimates for both the topic assignments 
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Zi as well as for the 9 and (j) parameters. Samples are taken at a regular interval, often called 
sampling lag^ 

To compute point estimates for 6 and 0, Rao-Blackwell estimators were employed in 
(Griffiths and Steyvers, 2004). Specifically, a point estimate of the probability of word type 

^kv T Pv 


V given topic k is computed as: 


4^kv ■“ 


V 


( 2 ) 


ijlkv' T Pv'') 

v'=l 


Similarly, a point estimate for the probability of the topic k given document d is given by: 

ndk + oik 


^dk 


K 

Nd+ Oik' 
k'=l 


( 3 ) 


During prediction (i.e. when applying CGS on documents that were unobserved during 
training), a common practice is to fix the (p distributions and set them to the ones learned 
during estimation. The sampling update equation presented in Equation thus becomes: 

p{zi = k\wi = V, d, OL, p, •) oc pkv ■ ^ ■ (4) 

Nd+ Y “fc' 

k'=l 


2.4 Labeled LDA 


An extension to unsupervised LDA for supervised multi-label document classification was 


proposed by Ramage et al. (2009). Their algorithm. Labeled LDA (LLDA), employs a 
one-to-one correspondence between topics and labels. During estimation of the model, the 
possible assignments for a word token to a topic are constrained to the training document’s 
observed labels. Therefore, during training, the sampling update of Equation becomes: 


p{zi = k\wi = V, d, a, f3, ■) (X < 




E ^d+ E «fe' 

v^=l k' = l 


0 , 


if A; G /Crf 


otherwise. 


( 5 ) 


Inference on test documents is performed similarly to standard LDA; estimates of the 
label-word types distributions, p, are learned on the training data and are fixed, and then 
the test documents’ 9 distributions are estimated using Equation Unlike in unsupervised 
LDA, where topics can change from iteration to iteration (the label switching problem), in 
LLDA topics remain steady (anchored to a label) and therefore it is possible to average 
point estimates of p and 9 over multiple Markov chains, thereby improving performance. 


Ramage et al. (2011) have introduced PLDA to relax the constraints of LLDA and exploit 


the unsupervised and supervised forms of LDA simultaneously. Their algorithm attempts to 
model hidden topics within each label, as well as unlabeled, corpus-wide latent topics. 


2. Averaging over dependent samples does not introduce any bias. Using a sampling lag (a.k.a. “thinning”) 
is merely convenient for reducing memory requirements and computation at test time. 
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Rubin et al. (2012) presented two extensions of the LLDA model. The first one, Prior- 


LDA, takes into account the label frequencies in the corpus via an informative Dirichlet 
prior over parameter 6 (we describe this process in detail in Section 6.3). The second 


extension, Dependency LDA, takes into account label dependencies by following a two 
stage approach: first, an unsupervised LDA model is trained on the observed labels. The 
estimated 9' parameters incorporate information about the label dependencies. Second, 
the LLDA model is trained as described in the previous paragraph. During prediction, the 
previously estimated 6' parameters of the unsupervised LDA model are used to calculate an 
asymmetrical hyperparameter which is in turn used to compute the 6 parameters of 
the LLDA model. 


2.5 Collapsed Variational Bayes with Zero Order Information — CVBO 

Collapsed Variational Bayesian (CVB) inference (Teh et al. 2006| ) is a deterministic inference 
technique for LDA. The key idea is to improve on Variational Bayes (VB) by marginalizing out 
9 and (/> as in a collapsed Gibbs sampler. The key parameters in CVB are the variational 
distributions over the topics, with representing the probability of the assignment of 
topic k to the word token Wi in document d, under the variational distribution. As the 
exact implementation of CVB is computationally expensive, the authors propose a second- 
order Taylor expansion as an approximation to compute the parameters of interest, which 
nevertheless improves over VB inference with respect to prediction performance. 

Asuncion et al.| (|2009|) presented a further approximation for CVB, by using a zero order 


Taylor expansion approximation]^ The update equation for CVBO is given by: 


'^dik OC 


E Kv'^i + 

v'=l 


+ Oik) 


( 6 ) 


D 


with = E Er.w,,=v^djk, n'^k = T.^ldjk- 


d=i 


■IJ-Wdj 


Point estimates for 9 and (j) are retrieved from the same equations as in CCS (Equations 
j^andj^. Even though the above update equation looks very similar to the one for CCS, 
the two algorithms have a couple of key differences. First, the counts and for CVBO 
differ from the respective ones for CCS; the former are summations over the variational 
probabilities while the latter sum over the topic assignments Zi of topics to words. Second, 
CVBO differs from CCS in that in every pass, instead of probabilistically sampling a hard 
topic-assignment for every token based on the sampling distribution in Equation it keeps 
(and updates) that probability distribution for every word token. A consequence of this 
procedure is that CVBO is a deterministic algorithm, whereas CCS is a stochastic algorithm 


(Asuncion, 2010). The deterministic nature of CVBO allows it to converge faster than other 


inference techniques. 

On the other hand, the CVBO algorithm has significantly larger memory requirements as 
it needs to store the variational distribution 7 for every token in the corpus. More recently. 


a stochastic extension of CVBO, SCVBO, was presented by Foulds et al. (2013), inspired by 


3. The first-order terms are zero, so this is actually equivalent to a first-order Taylor approximation. 
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the Stochastic Variational Bayes (SVB) algorithm of Hoffman et ah (2010a). Both SCVBO 
and SVB focus on efficient and fast inference of the LDA parameters in massive-scale data 
scenarios. SCVBO also improves the memory requirements of CVBO. Since CVBO maintains 
full, dense probability distributions for each word token, it is unable to leverage sparsity to 


accelerate the algorithm, unlike CCS (Yao et ah, 2009 Li et ah, 2014; Yuan et ah, 2015 


Chen et ah, 2016). Inspired by CVBO, in our work we aim to leverage the full distributional 


uncertainty information per word token, in the context of the parameter estimation step of 
the CCS algorithm, while maintaining the valuable sparsity properties of that algorithm 
during the training process. 


3. CGSp for Document-Topic Parameter Estimation 

In this section we present our new estimator for the document-topic {Od) parameters of 
LDA that make use of the full distributions of word tokens over topics. For simplicity, we 
first describe the case of 9d estimation during prediction and then extend our theory to 9d 
estimation during training. We present our estimator for the topic-word parameters {(p^) in 
the following section. 


3.1 Standard CGS 9d Estimator 


The standard estimator for document d’s parameters 9d from a collapsed Gibbs sample Zd 
(Equation]^, due to Griffiths and Steyvers (2004), corresponds to the posterior predictive 
distribution over topic assignments, i.e. 


= k\zd, a) 

(7) 

j = k,9d\zd,a) d9d 

(8) 

j = k\9d)p{9d\'Z‘d,OL) d9d 

(9) 

j 9dkP{9d\'Z'd,a) d9d 

(10) 

'^pi9d\zd,C()[ddk] 

(11) 

®'Dirichlet(ed|a-hnd) 

(12) 

^dk 

(13) 
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where the last two lines follow from Dirichlet/multinomial conjugacy, and from the mean 
of a Dirichlet distribution, respectively. Here, corresponds to a hypothetical “new” 

word in the document. The posterior predictive distribution can be understood via the urn 
process interpretation of collapsed Dirichlet-multinomial models. After marginalizing out 
the parameters 0 of a multinomial distribution x ~ Multinomial(0, N) with a Dirichlet prior 
9 ~ Dirichlet(Q;), we arrive at an urn process called the Dirichlet-multinomial distribution. 


a.k.a. the multivariate Polya urn, cf. (Minka, 2000). The urn process is as follows: 


Begin with an empty urn 
For each k, 1 < k < K 

— add Ofc balls of color k to the urn 


For each i, 1 < i < N 


— Reach into the urn and draw a ball uniformly at random 
— Observe its color, k. Count it, i.e. add one to Xk 

— Place the ball back in the urn, along with a new ball of the same color. 

We can interpret the posterior predictive distribution above as the probability of the next 
ball in the urn, i.e. the next word in the document, if we were to add one more word, by 
reaching into the urn once more. 


3.2 A Marginalized Estimation Principle 


The topic assignment vector for test document d is a latent variable which typically has 
quite substantial posterior uncertainty associated with it, and yet the most common practice 
is to ignore this uncertainty and construct a point estimate of 6^ from a single z^ sample. 


which can be detrimental to performance (Nguyen et ah, 2014). Following Griffiths and 


Steyvers (2004), we assume that the predictive probability of a new word’s topic assignment 
^(new) ^ principled estimate of Od- Differently to previous work, though, we take the 
perspective that the latent Zd is a nuisance variable which should be marginalized out. This 
leads to a marginalized version of Griffiths and Steyvers’ estimator. 


Odk = = k\wd, (k, a) = = k, z^jw^, cf, a) 


Zd 


(14) 


Due to its treatment of uncertainty in Zd, we advocate for 9d as a gold standard principle for 
constructing a point estimate of 9d. We will introduce computationally efficient Monte Carlo 
algorithms for approximating it below. Note that while previous works have considered 
averaging strategies which correspond to Monte Carlo approximations of this principle, 
including iGriffiths and Steyversl (|2004|) , (Asuncion et al.l (|2009l) , iNguyen et al.l (120141) , and 


references therein, discussion on the estimation principle of Equation that these methods 


approximate appears to be missing in the literature. Griffiths and Steyvers (2004) mention 


that such averaging strategies are possible but caution against them due to concerns regarding 
identifiability, however this identifiability issue does not arise at test time with topics (j) held 


fixed, as in the scenario we consider here. Nguyen et al. (2014) empirically explore averaging 
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procedures for making predictions by approximating the posterior predictive distribution. 
They do not aim to compute a point estimate from multiple samples, however their Single 
Average prediction strategy can be seen to be equivalent to using a naive Monte Carlo 
estimate (given in Equation [2^ of Equation 14 in the context of predicting held-out words. 


3.2.1 Interpretations 


We can interpret the estimator from Equation 14 in several other ways. 


Expected Griffiths and Steyvers estimator: The estimator 9d can readily be seen to 
be the posterior mean of the Griffiths and Steyvers estimator Od- 


Odk = = k\wd, (j), a) = = k, z^jw^, </>, a) (15) 

Zd 

= = k\zd, a)p{zd\wd, (j), a) (16) 

Zd 

= ^p(zd|wd,</.,«) = k\zd, a)] (17) 

~ ^p{za\'w^,4i,ci.)[^dk] ■ (18) 

Posterior mean of 9d'. Griffiths and Steyvers’ estimator can be viewed as the posterior 
mean of 9d, given Zd (Equation |ll[). Along these lines, we can interpret 9d as the marginal 
posterior mean: 

Odk = = k\wd, 4>, Ol) = [Odk] (19) 

~ -®P(zd|wd,0,c«) ^piea\za,a)[(^dk] (20) 

~ ^p{9d\'^d,<t>,a-)[^dk] ) ( 21 ) 

where the last line follows by the law of total expectation. 


Urn model interpretation: To interpret the estimator from an urn model perspective, 
we can plug in Equation 

Odk = = k\wd, 4>, a) = = k\zd, a)p{zd\wd, (p, a) (22) 

Zd 

= X] Ar - p{zd\wd,(p,a) . (23) 

z, Nd + }2k'=i(^k' 


We can view this as a mixture of urn models. It can be simulated by picking an urn with 
colored balls in it corresponding to the count vector n^ + ci, with probability according to the 
posterior p{zd\'Wd, (pi ct), then selecting a colored ball, corresponding to a topic assignment, 
from this urn uniformly at random. Alternatively, we can model this equation with a single 
urn, with (fractional) colored balls placed in it according to the count vector, 


oc ^(nrffc+ aA:)p(zd|wrf, (/), a) = ^p(zrf|wrf,(?i, Q;)nrffc+ afc , (24) 

Zd Zd 

and with the next ball (i.e. topic) selected according to the total number of balls of 

each color (i.e. topic) in the urn. 
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3.2.2 Monte Carlo algorithms for computing the estimator 

Due to the intractable sum over z^, we cannot compute Od exactly. A straightforward Monte 
Carlo estimate of 6d is given by 




'^dk Oik 


^^d + Oik' ^ 


‘■dk 


^ i=l + Ylk'=l “fc' 


(25) 

(26) 


where each of the count variables corresponds to a sample 


p{2d\^d,4>,oi). Algo¬ 
rithmically, this corresponds to drawing multiple MCMC samples of Zd and averaging the 
Griffiths and Steyvers estimator 9d over them. Alternatively, by linearity of expectation we 
can shift the expectation inwards to rewrite the above as 


E, 


'p(zd|wd,(/),a) 


^dk “h Olj^ 


+ X)a/=i 


= E,^ 


'p(zd|wd, 0 ,Q:) 


r Efci ^djk + oik 1 


■ A^d + ■ 

^p{za\wa,(f>,a)[zdjk] + Oik 

E'd + Z)^=l Oik' 




(27) 

(28) 


This leads to another Monte Carlo estimate. 


y-Nd I CO I ^ 

Z^j=l s l^i=l ^djk ^ Oik 


(i) 


7dk 


Ed + oik' 


(29) 


In this case we would once again draw multiple MCMC samples of z^, but average over 
the samples to compute the proportion of times that each word is assigned to each topic, 
applying Griffiths and Steyvers’ estimator to the resulting expected total counts. It can 
straightforwardly be shown that this estimator is mathematically equivalent to the estimator 


of Equation 26, so this algorithm will perform identically to that method. We mention this 
formulation for expository purposes, as it will be used as a launching point for deriving our 
proposed GGSp algorithm. 

3.3 CGSp for 9d 

We aim to design a statistically and computationally efficient Monte Garlo algorithm which 
improves on the above algorithms. To do this, we first consider a hypothetical Monte Garlo 


algorithm which is more expensive but easily understood, using Equation 29 as a starting 
point. We then propose our algorithm, GGSp, as a more computationally efficient version 
of this hypothetical method. Let z^*^ ~ p(zrf|wrf, (j), oc), i G {1,..., 5} be 5 samples from 
the posterior for the topic assignments, and let 1 G {1,..., L} be L Gibbs samples 

starting from zy, where a Gibbs update to the jth topic assignment is performed. The 
Bayesian posterior distribution is a stationary distribution of the Gibbs sampler, so these 


Gibbs updated samples are still samples from the posterior, i.e. 


-d 


p(zrf|wd,(/>, q). 
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We can therefore estimate posterior expectations of any quantity of interest using these 

(i) 

samples instead of the z^’s, and in particular: 


^dk 


2^j=i 


1 

SL l^i=l 2^1=1 '^djk 


^d + ^k' 


+ Otk 


(30) 


This method increases the sample size, and also the effective sample size, relative to the 
method of Equation |29[ but also greatly increases the computational effort to obtain 
the Monte Carlo estimator, making it relatively impractical. Fortunately, a variant of 
this procedure exists which is both computationally faster and has lower variance. Let 
(j)) be the transition operator for a Gibbs update on the jth topic assignment. 
Holding the other assignments fixed, we have the partially collapsed Gibbs update 


= il 


.(i) 


, (j)) OC cjik 


n 


(d 

dk- 


+ Oik 




N. 


d^j 


+ Z^fc'=i 


(31) 


By the strong law of large numbers. 


1 

2_^ ^djk 
1=1 


^T(zW- 






(d-s-j 

djk 


= T{z 


(d-s-f 

djk 



(/>) as L —>■ oo with probability 1. 


(32) 


So for the cost of a single Gibbs sweep through the document, we can obtain an effective 
number of inner loop samples L = oo by plugging in the Gibbs sampling probabilities as the 
expected topic counts in Equation This leads to our proposed estimator for 6dk ■ 


oP A ^j=^ 
^dk — 


q 

s 


Z^i=l^\^djk \^d 
^d + ^k' 


4>) + Ok 


(33) 


This method is equivalent to averaging over an infinite number of Gibbs samples adjacent to 
our initial samples f = 1,..., S' in order to estimate each word’s topic assignment probabilities, 
and thereby estimate the expected topic counts for the document, which correspond to our 
estimate of O^- By the Rao-Blackwell theorem, it has lower variance than the method in 
Equation 30 with finite L. We refer to this Monte Carlo procedure as CGSp, for collapsed 
Gibbs sampling with probabilities maintained. 

The above argument makes it clear that 6^ is an unbiased Monte Carlo estimator of the 
posterior predictive distribution, but for added rigor we can also show this directly: 
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01 


i=\ 

(34) 

s 


X] ^^djk ] + “fc 

i=l 

(35) 

p(zd|wd,fli,Q!) ^T{x^^\zd,<i>)^^djk\ + “fc 

(36) 


Nd 


Nd 


Nd 


J =1 
Nd 

i=i 

Nd 

~ ^p{zd\-Wd,<l>,cx) [zdjk] + Oik 

J =1 

Nd 

— -®p(zd|wd,0,a) ^djk] + Oik 

i=i 

'^p(Z(i|wd,0,Ql) “ 1 “ Clfc] 

r + “fc 

OC £/, 


■^p(z^ ^|wd,0,Qi) L djk 


zli] + ak 


'p(zd|wd,i;i,ci) I 


+ Z]^=l “fe' 


^pf^Z^new) ^ fc|wrf,^,Q;) . 


(by the law of total expectation) 
(by stationarity of the Gibbs sampler) 

(37) 

(38) 

(39) 

(by Equation 


We advocate the use of this procedure even when we can only afford a single burned-in sample, 
i.e. 5 = 1, as a cheap approximation to = A:|wrf, cj), a.) which can be used as a plug-in 

replacement to Griffiths and Steyvers (2004)’s single sample estimator = A:|Z(i,Q). 

Indeed, the increased stability of our estimator, due to implicit averaging, is expected to be 
especially valuable when S' = 1. We emphasize three primary observations arising from the 
above analysis. 


(1) We have shown that it is valid and potentially beneficial to use “soft” probabilistic 
counts instead of hard assignment counts when computing parameter estimates from 
collapsed Gibbs samples. This ports the soft clustering methodology of GVBO to 
the CGS setting at estimation time. While our approach computes soft counts via 
Gibbs sampling transition probabilities and GVBO uses variational distributions, the 
corresponding equations to compute them are very similar, cf. Equations and 

(2) Averaging these (hard or soft) count matrices over multiple samples can be performed 
to improve the stability of the estimators, as long as identifiability is resolved, e.g. at 
test time with the topics held fixed. 

(3) Averaging over either hard or soft count matrices will converge to the same solution in 
the long run: the marginalized version of the Griffiths and Steyvers estimator, given 
by Equation [T^ However, we expect the use of soft counts to be much more efficient 
in the number of samples S', as it implicitly averages over larger numbers of samples. 
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This last point is crucial in many common practical applications of topic modeling where 
we have a limited computational budget per document at prediction time, and can afford 
only a modest burn-in period, and few samples after burn-in. This situation frequently 
arises when evaluating topic models, for which a novel approach is typically compared to 
multiple baseline methods on a test set containing up to tens of thousands of documents, 
with the entire experiment being repeated across multiple corpora and/or training folds. 
The computational challenge is greatly exacerbated when evaluating topic model training 
algorithms, for which the above must be repeated at many stages of the training process 
(Foulds and Smyth, 2014). For instance, Asuncion et al. (2009) state that “in our experiments 
we don’t perform averaging over samples for CGS ... [in part] for computational reasons.” 
Computational efficiency at test time is also essential when using topic modeling for multi¬ 


label document classification (Ramage et al. 


2009 

2011 

Rubin et al. 

2012 


deployed system may need to perform inference on a never-ending stream of incoming test 
documents. 

Pseudocode illustrating the proposed method is provided in Algorithm . It should be 
noted than any existing CGS implementation can very straightforwardly be modified to 


and Narayanamurthy 

2010), and WarpLDA ( 

Chen et al. 

2016 


Our method is compatible 
with modern sparse and parallel implementations of CGS, as the final estimation procedure 
does not need to alter the training process. This can lead to a “best of both worlds” situation, 
where sparsity is leveraged during the expensive GGS inference process, but the full dense 
distributions are used during the relatively inexpensive parameter recovery step, which in 
many topic modeling applications is performed only once. In this setting, the improved 
performance is typically worth the computational overhead (which is similar to that of a 
single dense CGS update), as we demonstrate in Section 5.8 A practical implementation of 
our estimators can further leverage the fact that the count matrices are not modified during 
the procedure, by computing the sampling distribution only once for each distinct word type 
in each document, and by performing the computations in parallel. In the following we will 
sometimes abuse notation and denote the procedure of using 9^ to estimate the O/s as 6^ 
(and similarly for the procedure to estimate (/, denoted cj^, introduced below). 


3.3.1 GGSp FOR ESTIMATING 6d FOR TRAINING DOGUMENTS 

During the training phase, we can apply the same procedure to estimate 9d for training 
documents based on a given collapsed Gibbs sample 


^ dk 




(40) 




The collapsed Gibbs sampler does not explicitly maintain an estimate of cf to plug into the 
above. However, the Griffiths and Steyvers estimator for cj) (Equation]^ can be obtained 
from the same count matrices that CGS computes, as it corresponds almost exactly to the 
first term on the right hand side of Equation Alternatively, below we propose a novel 
estimator for 0 which can also be used here with a little bit more computational effort. 

Similarly to the Griffiths and Steyvers estimator, it should be noted that averaging 
this estimator over multiple samples from different training iterations raises issues with the 
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identifiability of the topics, which can potentially be problematic (Griffiths and Steyvers 


2004). We do not consider this procedure further here. It is however safe to apply Equation 


33 to multiple MCMC samples for that are generated while holding cj) fixed, i.e. treating 


document d as a test document for the purposes of estimating 0^ given the cj) from 


4. CGSp for Topic - Word Types Parameter Estimation 

By applying the same approach as for 6^, plugging in Gibbs transition probabilities instead 
of indicator variables into Equation we arrive at an analogous technique for estimating 0: 



E ^Train \ ' 

d=l /-jj-.Ulj j=V ^ 




'-'djk I 




Ed=r^ |z(d) + 


(41) 


where |z(*)) corresponds to the collapsed Gibbs update in Equation 


While the 


intuition of the above method is clear, there are several technical complications to an 
analogous derivation for it. Nevertheless, we show that the same reasoning essentially holds 
for (j), with minor approximations. In the following, we outline a justification for the c/f 
estimator, with emphasis on the places where the argument from 6^ does not apply, along 
with well-principled approximation steps to resolve these differences. 

First, consider the estimation principle underlying the technique. For topics (j), the 
standard Rao-Blackwellized estimator of Griffiths and Steyvers (2004), given by Equation]^ 
once again corresponds to the posterior predictive distribution, 


= J = k)dct)k 

^Dirichlet{iik+f3)[4^kv\ 

_ 'kl'kv 4 “ Pv 

~ ~V ’ 

(p'v'k 4“ /4ii') 

v'=l 


(42) 


where 'v^z=k is the collection of word tokens assigned to topic k. This estimator again 
conditions on the latent variable assignments z, which are uncertain quantities. We desire a 
marginalized estimator, summing out the latent variables z. However, without conditioning 
on z, the topic index k loses its meaning because the topics are not identifiable. The naive 
marginalized estimator 


4 = k) = 


(new) _ ^ 


= k) 


(43) 


sums over all permutations of the (collapsed representation of the) topics, and so conditioning 
on = k has no effect. The estimator in Equation 43 is therefore unfortunately not well 

defined. On the other hand, a sum over z’s is meaningful as long as the topics are aligned 
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for all values of z considered. Suppose, by some oracle, we know that z belongs to some 
identified subset of assignments Z, for which topic indices are aligned between all elements 
z' G Z. We are then able to properly define our idealized estimator with respect to Z, 


Ifj A ^ ^{new) ^ ^ z) = 




^{new) ^ ^{new) = ^ Z) . 


z,^Z 


(44) 

Regarding our proposed technique, the Monte Carlo estimator in Equation implicitly 
averages over the set of possible Gibbs samples Zcibbsi^^^^) that are adjacent to sample 
z^®), i.e. the candidate z assignments that differ from z^®^ in a single entry. A difference 
of a single word is extremely unlikely to cause label switching for a realistic corpus, so 
the topics will almost certainly be aligned for each of the adjacent Gibbs samples that 


the estimator implicitly averages over. The empirical results of Nguyen et al. (2014) also 


support this, as they found that averaging over multiple samples from the same MCMC 
chain was beneficial to performance, which suggests that identifiability was not a problem 

-dZaubsiZ^^)) 


as 


even for MCMC samples that are many iterations apart. Hence, we take 
our idealized estimator, under the supposition that Zcibbsi'^^'^'^) is identified, and we aim to 
derive Equationas a Monte Carlo algorithm that efficiently approximates it. 

To accomplish this, following the derivation for 6^, we would like to show that 


(1) we can use a Monte Carlo estimator in which the Zdj^s can be sampled individually, 
as in Equation and 


(2) the transition probabilities in Equation 41 can be used to implement this estimator 


with an effectively infinite number of adjacent Gibbs samples, as in Section 3.3 


While the previous stationarity argument for ([^ still applies, there is a further complication 
for (^l|. For 9^, linearity of expectation was sufficient to show in Equation 28 that 


E, 


'p{zd\'^d,<P,OL) 


' 'Zj=l ^djk A 1 _ Si = l -®p(zd|wd,0,Q:) i^djk] + Clfc 






^d + 


We cannot use this argument directly for cj^ because in this case the denominator of the 
Griffiths and Stevyers estimator, which is averaged over, also includes z terms which are 
involved in the expectation: 


EZcibbsiZ^)) 

^kv 




= E. 


'p(z|w,/3,zeZGii,(,s(z^®b) 


r-wd,j-- 


E ^Train „ I o \ 

d=l Ljj=l ^djk -r Z^v'=l W 


-Nd 


V 


(45) 


The issue here is that the adjacent Gibbs samples modify the overall counts per topic Uk = 
Yld=i‘^"" 'Zf=i ^djk- However, between adjacent Gibbs samples z and z(®\ the corresponding 

(i) 

topic counts nk are within ±1 of n),® and for a typical corpus we have that >> 1, so the 
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impact of a single word is negigible, i.e. 


^Train ^Train Derain ^d Dj'j-ain ^d 

iz iz E4i iz ^ E44 

d=l 3=1 d=l 3=1 d=l j=l d=l j=l 

^Train Nd 

for z E Zcibbsiz^"'') , E E Zdjk >> 1 . (46) 

d=i j=i 

In this case, we see from Equation that for a corpus in which >> 1, for all practical 
purposes the denominator in Equation is essentially constant over all adjacent Gibbs 
samples, and so 0 still holds approximately, and we have 


l,iZGibbs(Z^^)) _ 


kv 


= E. 


'p(z|w,/3,zeZGii,6s(z^*b) 


E. 


'p(z|w,/3,zeZGii)6s(z('b) 


-®p(z|w,/3,zeZGi6i,s(z(*b)l-^47’*:] E 

E ^Train Si) , y-E a 

d=i Z^j=i '^djk 2^v'=i yV 


EdJr“^" ^djk + A 


E;=i Zd^u + A.' J 

^j:wdj=v ^djk + f^v 

^D^ra in +ELlAr 


E -^Train 

d=l E/?'= 


j=l ^djk 


(47) 


A related argument was also made by Asuncion et al. (2009), who note that several topic 


modeling inference algorithms differ only by offsets of 1 or 0.5 to the counts, and state that 
“smce rifc is usually large, we do not expect [a small offset] to play a large role in learning^ 
To formalize this intuition, as we show in Appendix [B| the approximation can be bounded 
from below and from above as a function of as: 


nl 


+ E)/=1 Ad' 




+ Ed'=1 Ad' + 1 


nl 


A(z| w,^,zeZGi6i,s (z(*0) 


r-wd,j^ 


E ^Train y^JVd (i) I y^r o 

d=l 2^3=1 ^djk “T Z^d'=1 Pd' 




EdIr'^Er.u.d.=v^d3k + ff 


— ■®p(z|w,/3,zeZGi6i)s(zP0) 


J-Wd,j^ 


E ^Train y^I^^d -y,., I y^ ^ R 

d=l 2^3=1 “T Z^d'=1 Pd 

E ^Train \ ' ry I Q 

d=l 2^j-.Wd i=v ^djk -r Pd 


< 


nl 


+ Ep=l Ad 


n 


(0 


+ Ep=1 Ad' - 1 


-E. 


p(z\vf, 13 ,•z.&Z Gibbs (z^O) 


E ld'Train \^jyd AV I y^E a 
d=l 2^3=1 ^djk 2^v'=l Pd 


(48) 


for > 0. Since lim 


(i) 


"•fe^+EA=i /5„'+l 




= 1 and lim a) 'ff the left 

+E„'=i d„'-i 
ii) 

and right hand sides of Equation 47 will converge in the limit as the u^'’s go to infinity, e.g. 


when increasing the counts via a stream of incoming documents. In Appendix we provide 
a further empirical illustration of this approximation, by taking subsets of the documents in 
the corpus to vary Uk, and observing that as increases the difference between the left 


and right sides of Equation 47 approaches zero. 
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Finally, using the same stationarity and law of large numbers arguments as before, we 
obtain the (pP estimator by plugging in the transition operator, and then renormalizing. 


P^Train 

^kv ^ -^p(z|w,/3,zeZGii,f,s(z^'b) 

d=l j:'Waj=v 


^Train 

E E r( 4 P'|z«) + fl 


d=l j:wa,j=v 

(49) 


More formally, the full argument is given mathematically as 

d=l j:Wd,j=v 
P^Train 

. IZ, 

“d i^d 


(50) 

(51) 


d=l j:waj=v 
d^Train 

Y 1 Y 1 -®p(z|w,/3,zeZGii,i,s(z(0))[^47'*:] 

d=l j:waj=v 

(by stationarity of the Gibbs sampler) 


Z)d=l -^p(z|w,/3,zeZGii,6s(z^‘b)[^4?'fc] 

E ^Train „(d I a 

d=l l^j=l -^djk Z^v'=l 

Z)d=r'**" Ylj-.wai=v ^djk + Pv 


OC 


= E, 


'p(z|w,/3,zeZGii)6s(z(*b) 


E. 


'p(z|w,/3,zeZGii)i)s(z^“b) 


E ^Train Ai) I y^U o 

d=l Ejj=l ^djk Ejv'=1 

E ^Train \ '' ^ i O 

d=l l^j:wdj=v '^djk -r jJy 


Edlr^ Ei^i^djk + E:'=iP' 




V 


= p(^(--) = =k,zG Zcibbsiz^^^)) ■ 


(52) 

(53) 

(if Uk » 1) 

(54) 


In Algorithmic along with QP^s procedure, we also present the pseudocode for the (fP 
estimator. 


5. Unsupervised Learning Experiments — LDA 

This section describes the experiments that we performed to study the behavior of our 
proposed estimators in the context of unsupervised LDA models. We describe the data sets, 
the evaluation procedure and the experiments performed in the unsupervised LDA setting. 
Apart from the experiments reported here, we have performed two additional ones reported 
in the Appendix, (a) comparing the document-topic counts of the Collapsed Gibbs Sampling 
(CGS) algorithm against the soft document-topic counts, in order to study their differences 
as the algorithm converges and (b) providing an empirical validation of the approximation 
that we use in Equation |47| regarding our (jp estimator’s derivation. The relevant code of 
the following experiments (along with the code of the multi-label experiments of Section j^) 
can be found in https://github.com/ypapanik/cgs_p. 
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Data Set 

Train 

Drest 

Nd 

V 

BioASQ 

20,000 

10,000 

113.55 

135,186 

New York Times 

14,668 

7,000 

581.13 

19,259 

Reuters-21578 

11,475 

4,021 

49.68 

41,014 

TASA 

30,121 

7,529 

107.62 

21,970 


Table 3: Statistics for the data sets used in the unsupervised learning experiments. Average 
document length and number of word types refer to the training sets. 


5.1 Data Sets 

Four data sets were used in the unsupervised setting: a) BioASQ, b) New York Times, c) 
Reuters-21578 and d) TASA. The number of documents in the training and test sets, the 
average length of the training documents and the size of the vocabulary of word types found 
in the training set are presented in Table 


The first data set originates from the BioASQ challenge (Balikas et ah, 2014) that deals 


with large-scale online multi-label classification of biomedical journal articles. This learning 
task is particularly challenging as the taxonomy of labels includes around 27, 000 terms, 
with highly imbalanced frequencies. For the unsupervised experiments of this section, we 
used the 30,000 last documents of the BioASQ corpus of the year 2014, using the first 
20,000 as training documents and the rest as test documents. To construct the data set we 
concatenated the abstract and title of each article and removed common stopwords. The 
remaining unigrams were used as word types. 

The second data set contains articles published by the New York Times, manually 


annotated via their indexing service. We used the same data set as used by Rubin et al. 


(2012), with the same training set (14,668 documents) and keeping the first 7,000 documents 
for testing (out of the 15,989 of the original paper). 

The third data sel|^ contains documents from the Reuters news-wire and has been widely 
used among researchers for almost two decades. The split that we used has 11,475 documents 
for training and 4,021 documents for testing. We preprocessed the corpus by removing 
common stopwords. 

The TASA data set contains 37,650 documents of diverse educational materials (e.g., 


health, sciences, etc) collected by Touchstone Applied Science Associates (Landauer et al. 


1998). We used the first 30,121 documents as a training set and the remaining as a test set. 


The corpus already had stopwords and infrequent words removed, so we did not perform 
any further preprocessing. 

5.2 Evaluation 

Evaluation of LDA models typically focuses on the probability of a set of held-out documents 


given an already trained model (Wallach et ah, 2009). In this context, one must compute 


the model’s posterior predictive likelihood of all words in the test set, given estimates of the 
topic parameters cj) and the document-level mixture parameters 9. 


4. http://disi.unitn.it/moschitti/corpora/Reuters21578-Apte-115Cat.tar.gz 
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The log likelihood of a set of test documents DTest, given an already estimated model 
M, is given by Heinrich (2004) as: 


Drest DTest K 

Log Likelihood = logp(wd|M) = EE iog'^{4>kv ■ Odk) 

d=\ d=l i=l k=l 


with Wi = V. The perplexity will be: 


Log Likelihood 
Perplexity = exp(--- ) 

DTest 

E Nd 

d=l 


(55) 


(56) 


where lower values of perplexity signify a better model. 

Since 9d is unknown for test documents, and is intractable to marginalize over, a common 
practice in the literature in order to compute the above likelihood is to run the CGS algorithm 
for a few iterations on the first half of each document, and then to compute the perplexity of 
the held-out data (the second half of each document), based on the trained model’s posterior 
predictive distribution over words (Asuncion et ah, 2009). This is the approach we follow. 


5.3 Comparison between CGS and CGSp 

In this first experiment, the motivation is to validate the previously presented theory behind 
the (jf and 6^ methods. Specihcally, we are interested in verifying the following hypothesis: 
given a single burned-in sample, (pP or 6p will more effectively estimate the respective LDA 
parameters compared to the standard estimators, since the CGSp estimators make use of 
the full dense probabilities, through the infinite 1-steps of Eq. 33 This advantage would 


correspond to lower perplexity values for a single burned-in sample for the CGSp estimators 
compared to the standard GGS estimator. Furthermore, when averaging over multiple 
samples to estimate the 6 parameters during prediction, we expect that the 9p and standard 
9 estimators will eventually converge to the same solution given enough samples or Markov 
chains, since 9p aims to compute the posterior mean. However, we expect that 9p will 
converge to the posterior mean more rapidly in the number of samples that are averaged 
over than the standard estimator 9, with correspondingly faster improvement in perplexity. 

For this experiment we used all four data sets and considered four different topic number 
configurations (20, 50, 100, 500). For brevity, we report only the results for K = 100 and 
include the rest of the plots in Appendix During training, we ran one Markov chain 
for 200 iterations and took a single sample to calculate cp and (pP from the same chain. 
During prediction, since cp is fixed and topics are not exchanged through the Gibbs sampler’s 
iterations, we took multiple samples from multiple Markov chains, and averaged over these 
samples using the 9 (standard GGS) and 9P estimators. The burn-in period for the Gibbs 
sampler was set to 50 iterations and a sampling lag of 5 iterations was used. The a and 
(3 hyperparameters were symmetrical across all topics and words respectively and set such 
that ctfc = 0.1 and (3^ = 0.01. Finally, to ensure fairness between the 9 and 9p estimators, we 
used the same samples from the same chains to compute the respective estimates. 

In Figures and we show the perplexity scores against the number of samples that 
are averaged over, after burn-in, for one and five Markov chains respectively. All four 
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combinations of the estimators are depicted. By observing the results we notice some key 
points across all plots. First, if we consider the case where only one burned-in sample is 
used, corresponding to the left-most points in the plots, we can notice that both 9^ methods 
{(j) + 0^ and cjp + QP) decisively outperform the other two methods. This observation is 
particularly important for scenarios where we can’t afford to average over many samples due 
to computational or time limitations, as frequently occurs during the comparative evaluation 
of topic modeling methods, and in multi-label document classification. 

We also see that the (fP methods {(fP -I- 6 and (jP + 9^) have a short but steady advantage 
over the respective (p estimators. This advantage is not diminished for more samples or 
Markov chains, suggesting that cjP is actually a more accurate estimator than cp. We also 
remind here that in unsupervised scenarios we cannot average over samples to compute a 
(p estimate during training, since topics may be interchanged between iterations. A third 
remark, also aligned with our theoretical results, is that 9 and 9^ converge to the same 
solution after a sufficient number of samples are averaged over, although 9^ converges much 
more rapidly. Overall, these findings confirm that indeed both 9^ and cjP constitute improved 
estimators compared to the respective standard ones: 9'^ provides more rapid convergence 
than using the standard estimator 9 when averaging over samples for prediction on test 
documents, and (fP improves perplexity due to implicit averaging. The rest of the plots in 
Appendix |D] are in accordance with the above observations. 


5.4 Word Association: (pf \rs (p 


In this experiment, we compare cpf with (p on a. word association task. In word association, 
a given word, called the norm or cue word, is associated with a number of semantically 


related words, called targets or associates. We consider the data set provided by Nelson et al. 


(2004), which contains a set of 5,019 norm words and for each of them a set of associate 


words provided by human annotators. We aim to see how, given a specihc cue word, the 
two LDA estimators rank the corresponding associates, in order to assess which of the two 
performs better at predicting the targets, in terms of the median difference in ranks. The 
word association task provides a useful benchmark for evaluating the extent to which the 


topic representations are a good model of human semantic representation (Griffiths et al. 


2007). 


We largely follow the same setup as that described by Griffiths et al. (2007), pages 


220-224. Specifically, we use the TASA data set, removing only stopwords and word types 
with fewer than hve and more than ten thousand occurrences in the corpus, resulting in 
a vocabulary of 26,186 word types. There are 4,506 norms that belong to the obtained 
vocabulary. Subsequently, we train an LDA model on the corpus for K = 50,100, 500,1000 
and for 250 iterations, obtaining a single point estimate for (pP and (p for each value of K at 
the end of the procedure. Our goal is to see how each of the two estimators rank the hve 
most commonly associated responses to each of the norm words, as given by human subjects. 


Similarly to Griffiths et al. (2007) (refer to p.220 and Appendix B, p.244), in order to assess 
semantic association we use the conditional probability p(tt) 2 |'u^i) where wi stands for a given 
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(a) BioASQ 




(b) New York Times 


Figure 1: Perplexity against the number of samples averaged over, for the CGSp estimators 
and standard CGS. Results are taken by averaging over 5 different runs. Samples are taken 
after a burn-in period of 50 iterations. 


cue word and W 2 denotes each of the rest of the vocabulary word types. Supposing that u;i 
and W 2 belong to the same topic z, and assuming a uniform prior on z, we hav^ 


P{W2\WI) = Y,Pi^2\z)p{z\wi) = 


'^piw2\z) 


pjwilz) 

Y,p{wi\z') 


(57) 


Next, we consider the first five associate words of the norm wi and obtain the rank for 
each of them according to the probabilities p{w 2 \wi), for each of (jf and cj). In Figure]^ we 

5. A uniform prior p{z) follows if z is understood to be the first word in a document under the collapsed 
LDA urn model, with symmetric hyperparameters a. With asymmetric hyperparameters, p{z = k) oc a^- 
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(a) Reuters-21578 




(b) TASA 

Figure 2: Perplexity against the number of samples averaged over, for the CGSp methods 
and standard CGS. Results are taken by averaging over 5 different runs. Samples are taken 
after a burn-in period of 50 iterations. 


report the median difference in ranks, rank^f, — ranktpp against word frequency, a positive 
value indicating an advantage for (fP. To enhance readability, we have grouped the norms 
according to their frequencies within the corpus, into 10 categories. 

We observe two steady trends. First, ipP clearly outperforms cj) for rare norms (the first 
category), which is also the most populous (3,814 norms). For more frequent norms, (j:P 
and (j) behave almost on par. This trend reveals the distinctive beneht of our proposed 
estimator compared to its counterpart, especially if we take into account that word type 
frequencies typically follow a power law like distribution in document collections. A second 
trend, consistent with our other experiments, relates to the number of topics K: as K 
increases and the complexity of the LDA model correspondingly increases, the advantage 
of (pP becomes wider compared to (p. We similarly observe that cpP provides a benefit for 
increasingly frequent words as K increases. 
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(a) K = 50 


(b) K = 100 


800 



0 



315 944 1573 2202 2831 3460 4089 4718 5347 5976 

Norm Frequency (categories midpoints) 


315 944 1573 2202 2831 3460 4089 4718 5347 5976 

Norm Frequency (categories midpoints) 


(c) K = 500 


(d) K = 1000 



Category 


(e) Norms per category. 

Figure 3: Median difference in ranks produced by and cj) estimators for the first five 


associate words, for a set of 4,506 norms (Nelson et al., 2004). The estimators were computed 


after training LDA on the TASA corpus. A positive value indicates an advantage for 


These results are consistent with our analysis in Section which showed that performs 
implicit averaging in order to improve the stability of the topics. The nuanced uncertainty 
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information, and corresponding stability, provided by (ff is most important for less frequent 
words, and for larger K, for which the posterior uncertainty in the (smaller) count values n^v 
per (word,topic) pair is most consequential. For larger (word, topic) counts the expected 
counts estimated by (Equation [47] ) approach the observed counts computed by (j) due to 
the law of large numbers. Consequently, we typically observe little difference in the top-words 
lists generated by (jf and (j), which are often used to assess the interpretability of the topics. 
Instead, the results of this experiment suggest that (j:P conveys the most benefit for tasks 
that depend on the word/topic probabilities for the less frequent to moderately frequent 
words, as in word association, and any other task which requires semantic representations of 
the words. 


5.5 Comparison between CGS, CGSp and CVBO 

In this experiment we compared the CGS, CGSp and CVBO algorithms in terms of perplexity, 
across different numbers of topics. The motivation here was to examine how the CGSp 
method would perform compared to the rest of the algorithms, in a variety of configurations 
and data sets. 

In Appendix we report also the results of this experiment, comparing the three 
aforementioned algorithms with Variational Bayes (VB). Since the results for VB were 
steadily worse, we excluded them from Figure]^ to allow for easier comparison among CVBO, 
CGS and CGSp. Asuncion et al. ( 2009| > note that the disadvantage of VB versus the other 
algorithms can potentially be mitigated via hyperparameter learning. 

For this experiment we followed a similar approach to the one described by Asuncion 


et al. (2009). During training we ran each chain for 200 iterations to obtain a single point 
estimate of the (j) and distributions. During prediction, using the previously estimated (j), 
we ran 200 iterations from one chain for the first half of each document to obtain an estimate 
of 6. We followed the same approach to compute 9^, using cj^ in that casej^ We then 
computed perplexity on the second half of each document. The and hyperparameters 
were fixed across all data sets to uniform values, 0.1 and 0.01 respectively. CVBO was run 
with the same parameterization. 

Figure]^ shows the perplexity results for all data sets across different settings (20, 50, 
100, 200, 300, 500 and 1000 topics) for all methods. Each topics configuration was run for 
five times, obtaining the average perplexity value. There is a strong interaction between 
the perplexity scores, the data sets, and the number of topics, probably due to the diverse 
statistics of the data sets, such as the average document length and the average number of 
features (i.e. word types) per document (see Tablein Section 5.1), that characterize the 
data sets. All algorithms achieve their lowest perplexity values at around 200 topics. 

Despite the peculiarities of individual data sets, we can identify a broad general trend in 
these results. CVBO outperforms (marginally in three out of four cases) the other methods 
in all data sets for lower topic number values: for the BioASQ and TASA data sets this 
happens up to 50 topics, for the New New York Times data set up to 100 topics, while for 
Reuters-21578 up to 200 topics. As the number of topics increases though, this behavior 
is reversed and CGS and CGSp have the upper hand after the aforementioned numbers of 
topics. 


6. When referring to CGSp, we mean that the -1- 9^ estimators were used. 
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(a) BioASQ (b) New York Times 




(c) Reuters-21578 (d) TASA 


Figure 4: Perplexity against number of topics for the CGSp method, standard CGS and 
GVBO. Results are taken by averaging over 5 different runs. 


A possible explanation for this observation is related to the deterministic nature of 
GVBO; compared to its stochastic counterpart CGS, GVBO is more prone to getting stuck in 
local maxima. As the number of topics increases, we expect the hypothesis space to grow 
bigger, making it more difficult for GVBO to find a global optimum. CGS on the other hand, 
can exploit its stochastic nature to escape local maxima and converge to a better global 
representation of the data. Therefore, GVBO seems to be better suited for configurations 
with a small number of topics (in which case the fact that GVBO converges a lot faster than 
CGS as shown by Asuncion et al. (20091 is an additional advantage), while CGS fits better 
in the opposite case. Similar to the previous section, when comparing CGS and CGSp we 
can observe a slight but steady advantage of our method over the standard estimator, across 
all experimental settings. 
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(d) 200 topics 


(e) 500 topics 


(f) 1000 topics 


Figure 5: CGS, CGSp and GVBO convergence in terms of log likelihood during 400 training 
iterations in the TASA subset for different numbers of topics, and were hxed to 0.1 
and 0.01 respectively. 


5.6 Convergence Dnring Estimation for CGS, CGSp and CVBO 


The perplexity results of the previous section motivated us to further investigate the 
convergence behavior of CGS, CGSp and CVBO algorithms. In particular, we wanted to 
know the conditions under which CGS (and our modihcation of it) outperformed CVBO 
and vice versa. Furthermore, we wanted to see how these differences evolved as the two 
algorithms converged upon their respective solutions. 

Figure [^presents a comparison of CVBO, CGS and CGSp (during the estimation phase) 
on the TASA data set in terms of their log likelihoods over iterations, under different numbers 
of topics. To calculate the log likelihood for CGSp we used the and 6^ estimators. For 
each algorithm and each topics conhguration, we performed hve runs and obtained the mean 
log-likelihood values. 


The results seem to validate our previous observations with CVBO converging faster 
and to a higher log likelihood than CGS and CGSp when \K\ < 100. As the number of 
topics grows and the number of training iterations increases, though, the performance of 
CGS and CGSp catches up, and after 100 topics, they outperform CVBO. This is consistent 
with our hypothesis that CVBO performance is worse than that of CGS and CGSp as the 
hypothesis space grows, due to it being a deterministic algorithm and getting stuck in local 


maxima. These results seem to be consistent with the observations made by Teh et al. (2006 
Section 4) and could eventually emerge from either (or both) of the following factors: 
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Method 


Data sets 




BioASQ 

New York Times Reuters-21578 

TASA 

(t> + e 

-1.921 

MALLET 

-6.770 

-0.590 

-2.572 

(fp + e 

-1.912 

-6.763 

-0.585 

-2.567 

4> + ep 

-1.911 

-6.762 

-0.584 

-2.562 

^ + 9P 

-1.903 

-6.751 

-0.580 

-2.560 

(p + e 

-3.22 

WarpLDA 

-7.092 

-1.014 

-2.730 

(jf + 6 

-3.207 

-7.081 

-0.997 

-2.724 

cj) + ep 

-3.208 

-7.079 

-0.992 

-2.718 

(jp + ep 

-3.203 

-7.074 

-0.990 

-2.716 


Table 4: CGSp vs standard CGS estimator on MALLET and WarpLDA in terms of the Log 
Likelihood. All values are xlO^ and represent mean values across five runs. 


Variational inference methods (VB, CVB and GVBO) approximate the true posterior 


with a more tractable variational distribution (Blei et ah, 2003; Teh et al. 


Asuncion et ah, 2009). Furthermore, in the case of GVBO, as illustrated by (Foulds 


2014 


2006 


Section 4.4.1), the algorithm is a result of three consecutive approximations of 
the initial problem. On the other hand, GGS performs exact collapsed Gibbs updates 
and will eventually sample from the true posterior. 


• GVBO seems to be unable to avoid local maxima due to its deterministic nature, 
especially as the hypothesis space grows bigger, while CGS takes advantage of its 
stochastic nature in order to avoid them. 


A secondary observation is that CGSp increases its advantage over CGS as the number of 
topics grows. A final noteworthy observation from Figure is that GVBO exhibits very 
slow convergence behavior in the first 5-10 iterations, before reaching a regime of rapid 
convergence. While this phenomenon was not immediately visible in the results reported by 


Asuncion et al. (2009) (Figure 2), this apparent discrepancy is readily explained: in those 


results a linear scale was used on the X-axis, with data points plotted only every 20 iterations. 
This slow convergence behavior is in alignment with the well-known convergence speed issues 
for the closely related EM algorithm, which can also be understood as a variational method. 
For EM, Salakhutdinov et al. (2003) showed that convergence is very slow when the missing 
data distributions are highly uncertain. This is likely to be the case in the initial stages of 
GVBO and CGS, where the count histograms are relatively flat. We hypothesize that the hard 
assignments of CGS allow the algorithm to escape the slow-converging high-entropy regime 
more rapidly than GVBO, leading to better convergence behavior in the early iterations, 
before the determinism of GVBO once again gives it the upper hand (at least, with a small 
number of topics). 
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Duration(seconds) 

(c) K = 500 



Figure 6: Log Likelihood vs duration for CGSp against the standard CGS estimators on 
TASA, with WarpLDA. The time duration for each point in the plot corresponds to the 
running time of WarpLDA for different numbers of iterations (50, 100, 200, 500, 1000, 2000, 
5000), plus the overhead needed to compute each of the estimators at that timestep. 


5.7 CGSp with MALLET and WarpLDA 


In order to exhibit how straightforward it is to plug our methods into already existing CGS 
LDA implementations and, most importantly, in modern CGS variants, we have employed 


CGSp on top of the MALLET software package (McCahum, 2002b) and the WarpLDA 


(Chen et ah, 2016) implementation. In all cases, we have used K = 100, a = 0.1, /3 = 0.01. 


After running the respective implementations on the four datasets for 200 iterations, we have 
obtained a single sample for the z-assignments. Subsequently, and after calculating the nkw 
and Udk counters, we have computed the standard estimator and our proposed estimators. 
Table depicts the results in terms of the training data log-likelihood, showing a steady 
advantage of our methods compared to the standard estimator. 
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5.8 Run-time Overhead: CGSp vs CGS 

In the previous experiments, we have observed that our proposed estimators steadily outper¬ 
formed the standard CGS estimators. Nevertheless, our methods entail a computational 
time overhead since they require computing the full CGS probability distributions, which is 
roughly equivalent to a single dense standard CGS update. This overhead is most noticeable 
when modern sparse CGS implementations are used with a large number of topics K, since 
these methods can be executed in time sublinear in K, unlike the dense CGS update. In 
this section, we study the trade-off between the improved performance and the extra cost of 
our methods, as compared with the standard estimators. In particular, we are interested 
in examining the performance of our methods when paired with the fastest known GGS 
implementation, WarpLDA. 

In most applications of topic models, especially when dealing with large-scale data, LDA 
inference is run on a given corpus for a number of iterations and a single point estimate 
of the </> and 9 parameters is calculated at the end of the procedure, so we focus on that 
scenario for our experiment. When considering performance versus running time, based on 
our previous experiments we expect GGSp to attain higher log likelihood values, but with 
the extra overhead of the estimators calculation, while the standard GGS estimators will 
need more iterations to converge to those same log likelihood values, without the overhead. 
Therefore, we are interested to see if, and at which point during the process of training an 
LDA model, employing the GGSp estimators will be more beneficial time-wise than using 
the standard estimators. 

In this experiment, reported in Figure we run WarpLDA on the TASA data set for 
K = 50,100, 500,1000 and plot the training data log likelihood of GGS and GGSp against 
time duration. In Appendix]^ we report the relevant results for the rest of the data sets. The 
time duration for each point in the plot corresponds to the running time of WarpLDA for 
different numbers of iterations (50, 100, 200, 500, 1000, 2000, 5000), plus the overhead needed 
to compute each of the estimators at that timestep. We observe that in the early iterations, 
while LDA is still converging, the standard estimators attain a higher log likelihood value 
faster than the CGSp ones for K = 500,1000 (the two estimators are almost identical during 
burn-in for K = 50,100). This tendency is reversed after convergence of the procedure, with 
CGSp attaining a higher log likelihood value faster than CGS for all topic configurations. 
In most cases we are interested in obtaining an estimate of the LDA parameters well after 
convergence, so we consider that the CGSp estimators are to be preferred over the standard 
CGS ones, even under limited time resources. We also see that convergence takes longer 
with more topics, and the time before CGSp overtakes CGS is correspondingly longer, but 
the improvement in log-likelihood for CGSp over CGS becomes greater. 

6. Multi-Label Learning Experiments — Prior-LDA 

Another important context for comparing different approaches to LDA prediction is in a 
multi-label, supervised setting. Here, we considered a multi-label learning extension of 
LDA—Prior-LDA—and used it as a basis for comparisons of model predictions. Following the 
same organization as the previous section, we present the data sets, the evaluation measures, 
the experimental setup, and lastly the results with a discussion of their implications. 
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Documents Labels 


Data set 

I^Train 

Drest 

Nd 

K 

\l^d\ 

Avg. Freq. 

V 

Delicious 

12,910 

3,181 

16.72 

983 

19.06 

250.34 

500 

BioASQ 

20,000 

10,000 

113.55 

1,032 

9.05 

174.97 

135,186 

Bibtex 

4,880 

2,515 

54.06 

159 

2.38 

73.05 

1,806 

Bookmarks 

70,285 

17,570 

124.26 

208 

2.03 

682.90 

2,136 


Table 5: Statistics for the data sets used in the supervised learning experiments. Column 
“Avg. Freq.” refers to the average label frequency. All hgures concerning labels and word 
types are given for the respective training sets. 


6.1 Data Sets 


In this series of experiments we kept the BioASQ data set and used three additional labeled 
ones: Delicious, Bibtex and Bookmarks. Table presents their main statistics. These data 
sets were chosen as representative of the diversity of real-world data, where there are often 
many labels, and sometimes not many word tokens per training instance. 

Delicious, Bibtex and Bookmark^ are three widely used multi-label data sets (Katakis 
et ah, 2008; Tsoumakas et al.||2008[ ). We did not perform any further preprocessing of these 
data sets. A notable aspect of Delicious is that it contains very few word tokens per instance 
both in the training and test sets, making accurate classification difficult. Concerning the 
BioASQ data set, we used the same corpus as for the unsupervised learning experiments 
and the same preprocessing procedure. We did not use the entire labelset of the original 
data set, filtering out labels appearing in less than 40 instances. 


6.2 Evaluation Metrics 


We considered three widely-used performance measures: the micro-averaged FI measure 
(Micro-F for brevity) and the macro-averaged FI measnre (Macro-F for brevity) and the 


example-based FI measure (Tsoumakas et al. 2010). These measures are a weighted function 


of precision and recall, and emphasize the need for a model to perform well in terms of both 
of these underlying measures. The Macro-F score is the average of the Fl-scores that are 
achieved across all labels, while the example-based FI score is the respective average of the 
FI score across all test documents. The Micro-F score is the average FI score weighted by 
each label’s frequency. Macro-F tends to emphasize performance on infrequent labels, while 
Micro-F tends to emphasize performance on frequent labels. 


6.3 Setup of Prior-LDA (CGS and CVBO) 

The Prior-LDA model takes into account the relative label frequencies in the corpus. Test 
documents are biased to assign words to more frequent labels by using a non-symmetric 
vector ak on the 9 distributions. During training we kept the parameter symmetrical 
and set it to: 

7. http://mulan.sourceforge.net/datasets-mlc.html 
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50 

while during prediction we incorporated the label frequencies by setting it to: 

— /fe I ^ /'CTQ'i 

E/A: ^ 

with /fc standing for the frequency of label k. The /3^ parameter was set to 0.1 across all 
data sets. For all data sets, we calculated performance under two different configurations: 

• First, we wanted to emulate a scenario where only one sample can be afforded both 
for training and testing. In this case, during training we ran only one Markov chain 
and took a single estimate from the same chain for both (j) and cj/’. Similarly, during 
testing, we ran one chain with the estimated (p and one with the estimated (ff to 
obtain a single estimate for 9 and 6^ from each chain. In this manner, we obtained the 
predictions for all combinations of 4> and 9 estimators. 

• In the second configuration, we used 5 Markov chains and 30 samples from each 
chain, both during training and testing, following the same approach as above to 
obtain predictions from all four combinations of the cp and 9 estimators. In this case, 
the motivation was to examine performance with full convergence of the relevant 
estimators. 



In both training and prediction, we used a burn-in period of 50 iterations and a lag of 5 
iterations between each sample. All samples from all chains were averaged to obtain the 
respective parameter estimates for each method. 

For CVBO we used the same setup as the one described above. However, as CVBO is a 
deterministic algorithm that does not benefit greatly from averaging samples, in the second 
configuration each chain was initialized with a different random order of the documents and 
a different random initialization of the 7 parameters. 

Finally, in order to obtain a hard assignment of labels to documents from the 9 distri¬ 
butions, we used the Meta-Labeler approach (Tang et ah, 2009). In particular, we used a 


linear regression model to predict the number of labels per instance. To train this model, 
the same feature space as for the LLDA models was employed and the LibLinear package 
(Fan et al., 2008) was used for the implementation. 


6.4 Results and Discussion 

Tables show the Micro-F, the Macro-F and the example-based FI results respectively 
for all algorithms on the four data sets. We additionally show the average rank of each 
model, in terms of how it performs among the models on average across the four data 
sets. All results represent averages over five runs of the respective Gibbs sampler (or CVBO 
algorithm). Also, to test statistical significance of the differences, we performed a one sample 
z-test for proportions at a significance level of 0.05. 

First, let us consider the results for when only a single sample was used. A first remark 
is that CGSp and cp + 9^ are significantly better than CGS and (pP + 9 over all data sets 
and evaluation measures, confirming 9P'’s superiority over 9. Secondly, CVBO has a clear 
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Method 

Delicious 

BioASQ 

Bibtex 

Bookmarks 

Avg Rank 



1 MC X 

: 1 sample 



CGS 

0.2126±0.0034A 

0.3075±0.0041A 

0.2388±0.0024A 

0.1546±0.0042A 

4.5 

0 + 0^ 

0.2472±0.0038A 

0.3616±0.0037A 

0.2985L0.0029 

0.1696±0.0036A 

2.5 

(j)P + 9 

0.2127±0.0027A 

0.3081±0.0038A 

0.2337±0.0026A 

0.1542±0.0046A 

4.5 

CGSp 

0.2531±0.0032A 

0.3623±0.0041A 

0.2900±0.0021 

0.1712±0.0027A 

2.25 

CVBO 

0.2761A0.0015 

0.4122A0.0032 

0.2907A0.0014 

0.1810A0.0032 

1.25 



5 MC X 

30 samples 



CGS 

0.2699±0.0017A 

0.4509±0.0021A 

0.2873±0.0011A 

0.1819±0.0011A 

4.25 

(l) + 0P 

0.3184±0.0015 

0.4628±0.0019 

0.3704A0.0013 

0.2035±0.0018 

1.75 

(j>p + e 

0.2699±0.0019A 

0.4509±0.0024A 

0.2842±0.0015A 

0.1820±0.0023A 

3.75 

CGSp 

0.3189±0.0012 

0.4633A0.0009 

0.36277±0.0005 

0.2039L0.0024 

1.25 

CVBO 

0.2773±0.0008A 

0.4125±0.0006A 

0.2910±0.0002A 

0.1811±0.0004A 

4 


Table 6 : Results for the Micro-Fl measure comparing the Prior-LDA models. A A symbol 
represents statistically significant difference compared to the best performing model, at a 
level of 0.05. 


advantage over the rest of the methods, in the majority of the data sets and evaluation 
measures. Only CGSp and (f) + 6^ manage to outperform it in three out of twelve cases (four 
data sets and three evaluation measures). 

At first glance, these results suggest that, when only one sample from one Markov chain 
can be afforded, CVBO should be perhaps considered first, followed by CGSp and cj) + 9^ 
as alternatives. In order to more carefully evaluate the two algorithms for scenarios where 
computational time is a critical parameter, in Appendixj^we performed a short investigation 
in order to study if and when CGSp performance surpasses that of CVBO. Specifically, we 
plotted performance (Micro-F, Macro-F, example-based FI) against the number of samples 
taken, for all four data sets. The results show that CGSp manages to outperform CVBO in 
seven out of twelve cases, by taking only two samples from one Markov chain, while, after 
taking ten samples from one chain, CGSp outperforms CVBO in eleven out of twelve cases. 
Considering that there exist a number of approaches for CCS that exploit sparsity to speed 
up LDA estimation and prediction (Porteous et ah, 2008; Yao et abj 2009 Li et al., 2014), 


while this is not the case for CVBO, we could claim that in many cases CGSp is equally fit 
or even superior to CVBO, in terms of total computational timej^ 

Let us now focus on the scenario where we have averaged over multiple samples and 
multiple Markov chains to calculate predictions. In this case, model performance shifts in 
favor of the CGS-based predictions. Whereas the CCS algorithm benefits from averaging 
samples—due to the fact that each sample is ideally a draw from the posterior distribution— 
CVBO is deterministic, and achieves very small improvements (if any), presumably since 
it tends to converge to the same specific maximum. When five chains are used, CGSp and 


8. Note that the stochastic variant SCVBO (Foulds et al. 20131 is much more scalable than CGS in the 


amount of data, but less scalable in the number of topics due to dense updates. 
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Method 

Delicious 

BioASQ 

Bibtex 

Bookmarks 

Avg 

Rank 

1 MC X 1 sample 

CGS 

0.0337±0.0012)A 

0.1767±0.0039A 

0.1582±0.0033A 

0.0928±0.0017A 

4.5 

4> + 9P 

0.0597T0.0020 

0.2160±0.0031A 

0.2019±0.0030A 

0.1009±0.0023A 

2.25 

(pp + e 

0.0332±0.0015A 

0.1769±0.0040A 

0.1548±0.0031A 

0.0930±0.0019A 

4.5 

CGSp 

0.0585±0.0018 

0.2168±0.0036A 

0.1946±0.0036A 

0.1015±0.0016A 

2.25 

CVBO 

0.0466±0.0010A 

0.3039±0.0014 

0.2232A0.0021 

0.1169±0.0008 

1.5 

5 MC X 30 samples 

CGS 

0.0373±0.0006A 

0.3011±0.0014A 

0.2144±0.0013A 

0.1167±0.0009A 

4.5 

(p + QP 

0.0874±0.0004 

0.3138±0.0016 

0.2693±0.0011 

0.1294±0.0008 

1.5 

(pp + e 

0.0362±0.0011A 

0.3016±0.0015A 

0.2086±0.0017A 

0.1168±0.0013A 

4.5 

CGSp 

0.0855±0.0012 

0.3146±0.0016 

0.2579±0.0021 

0.1299±0.0011 

1.5 

CVBO 

0.0481±0.0003A 

0.3075±0.0005 

0.2237±0.0008A 

0.1173± 0.0006A 

3 


Table 7: Results for the Macro-F measure. A A symbol represents statistically significant 
difference compared to the best performing model, at a level of 0.05. 


(j) + 0^ are significantly better than CVBO in all data sets and for all evaluation measures. 
CGS and + 9 on the other hand, seem to achieve equivalent performance to CVBO overall. 

Among the CGS-based methods, we observe that, even if 9 and 9^ should theoretically 
converge to the same solution (at least in an unsupervised learning context) even after a 
total of 150 samples (30 samples from hve chains) their difference is significantly large for all 
evaluation measures and all data sets (the BioASQ data set represents partly an exception 
with smaller differences between the four methods). Another important note concerns the 
comparison between models predicting with (jp and those predicting with cj). Results are 
mixed in this case, with models predicting with (fp having a slight, but not statistically 
significant advantage. 

Overall, the supervised experiments show a consistent advantage of the CGSp and 
(f> + 9P estimators over the respective standard CGS estimator. Also, CGSp and (p + 9^ are 
competitive with GVBO in scenarios where only one sample can be afforded. Finally, when 
averaging over many samples and Markov chains, CGSp and 4> + 9^ distinctly outperform 
CVBO. 

7. Conclusions 

In this work we proposed a novel method for accurately estimating the document-wise and 
topic-wise LDA parameters from collapsed Gibbs samples by implicitly averaging over many 
samples, leading to improved estimation with little extra computational cost. Our algorithm 
can be interpreted as using soft clustering information, as in the CVBO collapsed variational 
inference algorithm, but in a Markov chain Monte Carlo setting. This allows us to use all of 
the uncertainty information encoded in the dense Gibbs sampling transition probabilities to 
recover the parameters from a GGS sample, while still leveraging the sparsity inherent in the 
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Method 

Delicious 

BioASQ 

Bibtex 

Bookmarks 

Avg Rank 



1 MC X 

: 1 sample 



CGS 

0.2047±0.0019A 

0.2930±0.0037A 

0.2546±0.0031A 

0.2007T0.0031A 

4.25 

0 + 0P 

0.2389±0.0023A 

0.3400±0.0042 

0.3127T0.0036 

0.2152±0.0032A 

2.5 

(j)P + e 

0.2035±0.0014A 

0.2942±0.0034A 

0.2472±0.0028A 

0.1999±0.0029A 

4.75 

CGSp 

0.2444±0.0019A 

0.3402±0.0037 

0.3048±0.0033 

0.2166±0.0031A 

2 

CVBO 

0.2672±0.0015 

0.3504±0.0018 

0.2995±0.0017A 

0.2225T0.0021 

1.5 



5 MC X 

30 samples 



CGS 

0.2625±0.0007A 

0.4180±0.0029 

0.2958±0.0023A 

0.2226±0.0005A 

4.25 

(l) + 0P 

0.3118±0.0011 

0.4269±0.0020 

0.3807T0.0011 

0.2424±0.0016 

1.5 

(j>p + e 

0.2610±0.0012A 

0.4181±0.0023 

0.2922±0.0016A 

0.2229±0.0007A 

4 

CGSp 

0.3086±0.0007 

0.4276T0.0025 

0.3733±0.0021 

0.2431T0.0009 

1.5 

CVBO 

0.2690±0.0004A 

0.3568±0.0021A 

0.2997±0.0005A 

0.2227±0.0003A 

3.75 


Table 8 : Results for the example-based F-measure. A A symbol represents statistically 
significant difference compared to the best performing model, at a level of 0.05. 


CGS algorithm during the training process. We extensively investigated the performance 
of the proposed estimators for both unsupervised and supervised topic models, and with 
comparison to CVBO. Our results demonstrate that our 6^ and cjf estimators are more 
effective than the standard estimator in terms of perplexity, and that 9^ also improves 
multi-label classification performance for the supervised Prior-LDA model. 


Our experiments showed an advantage of our CGS-based estimators over CVBO in the 
majority of cases, and also revealed the consequences of the deterministic nature of that 
algorithm. In the unsupervised scenario, CVBO has a clear advantage for LDA models with 
few topics, while this phenomenon is reversed as the posterior landscape becomes more 
complicated with a greater number of topics. In the multi-label learning case, CVBO has 
the upper hand when only one sample can be afforded, while it is decisively outperformed 
by CGSp when averaging over multiple samples. The success of our methods illustrates the 
value of averaging over multiple MGMC samples for LDA, even when only a point estimate 
is required. In future work, we anticipate that our ideas can be adapted to other models 
with collapsed representations, including latent variable blockmodels for social networks 


and other network data (Kemp et ah, 2006), and hidden Markov models (Goldwater and 


Griffiths, 2007). 
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Figure 7: Hard and soft document-topic counts absolute difference against the number of 
iterations. The counts are normalized to 1 before taking their difference, hence the difference 
taking values between 0 and 1. 
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Appendix A. Hard Counts vs Soft Counts 

Algorithmically (refer to Algorithm Q, the difference between the 9 and 9^ estimators lies in 
the difference between the hard counts Udk and the soft counts given by j, k), p 

being the sampling update equation of CCS given in equation In this experiment, we 
examine how similar are the two counts. For this reason, we have run CCS on the four data 
sets setting K = 100, a = 0.1, j3 = 0.01 for 50 iterations and calculated the sum of the 
absolute differences between the two counts, or, in other words: 
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Topics Topics 

(a) iteration = 50 (b) iteration = 100 




Topics Topics 

(c) iteration = 500 (d) iteration = 1000 

Figure 8: Absolute difference between the hard and the soft counts across the K different 
topics, for iteration 50, 100 and 1000, for the first document of the TASA collection. 


Figure shows the absolute sum of differences across the number of iterations. We can 
observe a steady difference among the two counts, that diminishes but does not go away 
with the number of iterations. Furthermore, in Figure we depict for the TASA data set 
and the first document of the collection, the absolute difference between the hard and the 
soft counts across the K different topics, for iteration 50, 100 and 1000. Similarly to our 
previous observations, we see that the two counts are similar, but present a small but steady 
difference across topics, for all iterations. 


41 






















































































Papanikolaou, Foulds, Rubin and Tsoumakas 


Appendix B. Bounding the Approximation for 

We wish to bound the error of the approximation step used in the derivation for the (f)p 


technique, given in Equation 47 




E. 


'p{z| 








'p{z| W.^.zeZoitbs (z^*b) 


Assume that all > 0. Then, starting from the left hand side of the above, we have: 
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Appendix C. Empirical validation of the approximation for cjf 


When deriving our (fi^ estimator, in order to prove that represents the expected value of 
the standard CGS cj) estimator, we take the following approximation; we consider that the 
denominator of Equation [45| is not changing, between two adjacent Gibbs samples. Since Zdjk 
is within ±1 of when taking adjacent Gibbs samples and in a typical corpus Uk » 1, 
we can assume that ~ ± 1. Using this approximation, we are able to consider that the 

denominator in the left hand of Equation 47 is a constant when taking an adjacent sample. 
Hence, the expected value of the fraction will be the expected value of the numerator, divided 
by the denominator. 

In order to empirically validate this argument, we considered the following experiment: 
We have run GGS on the New York Times data set for 100 iterations and for K = 100. Next 
we computed explicitly, the left and the right sides of Equation for all topics and word 
types. Subsequently, we have measured the accuracy of the approximation of Equation [47] 
for a varying size of documents (and therefore n^), by plotting the absolute difference |left 
hand side - right hand side| for each topic, as a function of the size of I?, or where 


E DTrain \ ' i Q 

d=l z^j-.Wd i=v '^d.jk -r P; 


a = E, 




J-Wd,j = 


E ^Train \ -v , , I \ Q 

d=l 2^j = i 2^V' = 1 


Y.dlr" Ylj-.Wdj=v Ep(z\v,,f3,z&ZGitU^(i)))[^djk] + Pv 

E ^Train ^(0 I o 

d=l '^djk Z^v'=l 


(63) 


Eigure plots the results from the above experiment. As expected, by increasing the 
considered subset of documents and therefore n^, the error of the approximation for each 
topic approaches zero. 


Appendix D. 

Eigures [To| - [T3| present additional results for the experiment of Section using different 
topic conhgurations {K = 20, 50, 500) for each of the four data sets (BioASQ, New York 
Times, Reuters-21578, TASA) respectively. These results are aligned with the conclusions 
made in Section |5.3| and are reported here as further empirical validation of the theory 
behind GGSp. 


Appendix E. 

In this section we report in Figure [T4| additionally to the results of Section 5.5 the perplexity 
across different topic configurations for VB against GVBO, GGS and GGSp. We employed 
the Ida-c package, publicly available at https://github.coni/blei-lab/lda-c, keeping all 
parameters to default for the Variational EM procedure. 

Apart from the New York Times data set, where VB manages to outperform GVBO for 
K > 200 and the GGS algorithms for K = 200, the algorithm performs worse than other 
methods in all other cases. These results seem to be aligned with previous results in the 


literature (Teh et al., 2006; Asuncion et al., 2009) 
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3 


X 10 



Documents 


Figure 9: Absolute difference, per topic, between the left and right hand of Equation 47 


As the number of documents increases, the difference approaches 0. Here, o, b are given in 
Equation 63 and the sum is divided by K in order to obtain the per-topic differences. 


Appendix F. 

We report in Figures [T5]-[T7| additional results for the rest of the data sets, for our experiment 
of section |5.8[ regarding the run-time overhead of the CGSp estimators. The results validate 
the conclusions drawn in section [5^ 


Appendix G. 


We report the results of an additional experiment that we performed in order to investigate 
how many samples CGSp, cj) + 6^ and (jp + 9 would require to outperform CVBO (Figures [T^ 
21). The same parameter setup was used as in the rest of the multi-label experiments of 


Section [H 
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(a) A = 20 




(b) A = 50 




(c) K = 500 


Figure 10: Perplexity against the number of samples for the CGSp method and standard 
CGS for the BioASQ data set. Results are taken by averaging over 5 different runs. 
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(a) K = 2Q 
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(b) ib = 50 




(c) K = 500 

Figure 11; Perplexity against the number of samples for the CGSp methods and standard 
CGS for the New York Times data set. Results are taken by averaging over 5 different runs. 
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(a) A = 20 




(b) A = 50 




(c) K = 500 

Figure 12; Perplexity against the number of samples for the CGSp methods and standard 
CGS for the Reuters-21578 data set. Results are taken by averaging over 5 different runs. 
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(a) K = 2Q 




(b) ib = 50 




(c) K = 500 

Figure 13; Perplexity against the number of samples for the CGSp methods and standard 
CGS for the TASA data set. Results are taken by averaging over 5 different runs. 
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(d) TASA 


Figure 14: Perplexity against number of topics for the CGSp method, standard CGS, GVBO 
and VB. Results are taken by averaging over 5 different runs. 
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Figure 15: Log Likelihood vs duration for CGSp against the standard CGS estimators on 
BioASQ, with WarpLDA. The time duration for each point in the plot corresponds to the 
running time of WarpLDA for different numbers of iterations (50, 100, 200, 500, 1000, 2000, 
5000), plus the overhead needed to compute each of the estimators at that timestep. 
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Figure 16: Log Likelihood vs duration for CGSp against the standard CGS estimators on 
Reuters-21578, with WarpLDA. The time duration for each point in the plot corresponds to 
the running time of WarpLDA for different numbers of iterations (50, 100, 200, 500, 1000, 
2000, 5000), plus the overhead needed to compute each of the estimators at that timestep. 
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Figure 17: Log Likelihood vs duration for CGSp against the standard CGS estimators on 
New York Times, with WarpLDA. The time duration for each point in the plot corresponds 
to the running time of WarpLDA for different numbers of iterations (50, 100, 200, 500, 1000, 
2000, 5000), plus the overhead needed to compute each of the estimators at that timestep. 
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Figure 18; Delicious. 


Figure 19: BioASQ 
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